Overview

This session doesn’t assume any prior knowledge of R, and introduces the basics. For some students this will include revision of material from stage 1. However we provide additional material for advanced students to test their knowledge and extend familiar skills.

To be clear, this repetition is intentional: we find most students will benefit from refreshing their knowledge at this stage in the course. Even if you are quite confident when using RStudio please read the worksheet carefully and complete all of the activities in the blue boxes.

Using the RStudio interface

  • Access RStudio at https://rstudio.plymouth.ac.uk
  • Use the latest version of the Firefox web browser
  • Tell R what to do in the Console pane
  • See the Environment pane for stored data
  • Use the Files pane to open code and data from a folder on the server

No code in this video!

If you’re using Windows or an older Mac we strongly recommend downloading Firefox and using that. If you have any issues with RStudio this is likely the first suggestion we will make.

When you login to RStudio, you’ll be greeted with a screen that looks something like the image below.

RStudio on first opening

You can see three parts:

  1. The Console - This is the large rectangle on the left. It is where you tell R what to do, and where R prints the answers to your questions.

  2. The Environment - This is the rectangle on the top right. It is where R keeps a list of the data it knows about. It’s empty at the moment, because we haven’t given R any data yet.

  3. The Files - This is the rectangle on the bottom right. It’s a bit like the File Explorer in Windows, or the Finder on a Mac. It shows you what files and folders R can see.

You should also be able to see that the two rectangles on the right have a number of other “tabs”. These work like tabs on a web browser.

The top rectangle has the tabs Environment and History. The History tab keeps a record of what you’ve recently typed into the Console. This can sometimes be useful.

The bottom rectangle has the tabs Files, Plots, Packages, Help, and Viewer. We’ll cover what these other tabs do later on.

Before you start

  • Before starting you must run some R code to get set up.
  • See the code tab or the exercise below.
# run an R script over the internet which will get you
# set up, and copy files you need to your home folder
source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")

To get everyone off to the same start we have created a script that copies some files into your home folder on the RStudio server.

To run this script, we just copy and paste the following line into the Console:

source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")
  1. Click on the Console pane.
  2. Copy-paste the following into the console:

source("https://raw.githubusercontent.com/benwhalley/lifesavR/main/bootstrap.R")

Your console should now look like this:

Press ↩︎ to run the code. If your console looks like the image below, then you are ready to start the session.

Using the workbooks

  • Each session has an associated “workbook” file
  • They end with the file extension ".rmd"
  • These were copied to your home folder by the bootstrap script (above)
  • Use them to complete the exercises in the worksheet

No code shown in this video

Each session has an associated “workbook” file which you will use to complete the exercises in the worksheet. The file you need for this session is called session-1.rmd.

If you click on the file it opens the workbook in a tab of a new pane, called the Source pane. It’s called the Source pane because statements writting in the R language are often referred to as ‘R code’, which is shorthand for ‘R source code’. The source pane allows you to write R code and explore your data.

Click on session-1.rmd in the Files pane.

If you’re able to open this file you are now ready to start the rest of the session.

What can R do?

  • R is a multi-purpose tool
  • It can do simple arithmetic, load data, make plots etc.
  • It can also run any statistical analysis you like
  • You need to tell R exactly what to do, by providing precise instructions
  • These instructions (code) provide a reproducible record of your work
# multiply two numbers
2 * 221

# generate some random numbers with a normal distribution
rnorm(10, 0,1)

# histogram plot of random numbers
hist(rnorm(100, 0,1))

R is a computer language for data analysis and visualisation.

RStudio is a user interface to R; it helps you organise your work.

R is a text-based language. You interact with it by typing commands and running (also called ‘executing’) them.

R can do everything from simple arithmetic and plotting to complex data analysis.

For example, you can do simple arithmetic like

[2 * 221]

We could generate some random numbers with a normal distribution

rnorm(10, 0,1)
 [1] -0.30718194  0.05030063 -0.76868554 -1.27331824  2.14723562  1.68826929 -0.38580193 -0.46802252  0.49992883
[10] -0.51172558

And we could plot random numbers using a histogram

hist(rnorm(100, 0,1))


You should think of R as a robot.

The robot is extremely fast, powerful and tireless; but it’s also literal-minded, and won’t think for itself or take the initiative. You need to tell it exactly what to do, by providing very precise instructions.

The advantage of writing detailed instructions is that you have a detailed, reproducible version of all your analyses.

Reproducibility is a key topic in psychology and other natural sciences — learning R (or something like it) is an important skill for new psychologists.

Working interactively in R Markdown

  • RMarkdown documents combine ‘chunks’ of R code with regular text
  • RMarkdown files end with “.rmd” or “.Rmd
  • To make a chunk type: Ctrl + Alt + I (Windows/Linux) or ⌘ + Alt + I (Mac)

  • To run a line type Ctrl + ↵ (Windows/Linux) or ⌘ + ↩︎ (Mac)
  • Run part of a line select it first, then use the same shortcut

  • Anything outside a chunk is just narrative (ordinary) text and not treated as code.

Backticks:

On windows

On a Mac

No code shown in this video

RMarkdown documents are a good way to use R

RMarkdown is a file format which combines R code (chunks) with regular text.

RMarkdown can combine data analysis and graphs with explanatory text.

[SHOW EXAMPLE OF AN R MARKDOWN DOCUMENT.]

In the finished document, the Code is evaluated and the results are interspersed with text.

This allows us to make high quality reports, research papers, dissertations or books.

Because it’s such a powerful tool, this module provides an early introduction to RMarkdown, although we don’t introduce all it’s features just yet.

For the moment, we’ll only be using Rmd document as an interactive interface for running R code and looking at the results R produces.

R Markdown documents can be used interactively in RStudio

One neat feature of Rmd files is that, when you open them in RStudio, they make it easy to organise and run R code, and see the outputs.

If you click on the lifesavr folder in the Files pane of RStudio,

[CLICK ON FILES FOLDER IN VIDEO.]

you’ll notice that some files have the extension .rmd. These are R Markdown files.

[HIGHLIGHT FILE EXTENSION BY SELECTING OR POINTING WITH MOUSE.]

The file extension .rmd (or .Rmd) is important, because this is how R Studio knows that the files contain a mixture of R code and regular text.

Code chunks

RStudio needs to distinguish R code from regular narrative text.

This is done by putting the code inside some special characters, creating a chunk.

A chunk is opened using the symbols ```{r}, and closed using the symbols ```. This is what a chunk looks like in RStudio:

A code chunk in the RMarkdown editor


NOTE: The symbols which start and end a chunk are backticks, not single quotes. The difference is quite subtle.

Backticks are on your keyboard here if you’re on Windows:

On windows

Or here if you’re on a Mac:

On a Mac

Running R code inside chunks

There are three ways to run R code within a chunk.

The first is to run a complete line of code.

You can see here that our cursor is on line 12. The cursor can be anywhere on that line. To run the line, press Ctrl + on Windows or Linux, or + ↩︎ on a Mac.

Pressing these keys has run or executed that line of code.

You’ll see some output beneath the chunk that you don’t need to worry about for now, but one of the effects of running this code is to load a dataset about diamonds (prices, sizes, quality, etc).

Now that line 12 has been run, the cursor has been pushed down to line 13.

Lines 13 to 15 are actually part of the same statement — that is, R knows they are related to one another.

We use the same keys, Ctrl + , to run these lines. This generates a scatter plot using the diamonds dataset. Don’t worry how these statements work for now — the point here is to show you that we can run code interactively using Rmarkdown.

The second way to run code is to select only the parts you want to execute. If you select just the word diamonds on line 13 and run that, you will see that it does something different.

[SELECT DIAMONDS AND RUN IT.]

This prints the contents of the diamonds data. Because the dataset is large, it just prints the first few rows.


Finally, you often want to run all of the code in a chunk at once.

This can be done by pressing the green arrow on the right hand side of the chunk.

Another way to run all of the code is to position your cursor anywhere within the chunk and press Ctrl + + (Windows, Linux) or + + ↩︎ (Mac).

Exercise 1

  1. Locate the first chunk in session-1.rmd (you find this in the Files pane)
  2. Place your cursor (anywhere) on the line that says library(tidyverse) (this code is explained in the next section)
  3. Run the code by pressing Ctrl + (Windows, Linux) or + ↩︎ (Mac)

You will see some output appear beneath the chunk. Don’t worry about the details for now, we’ll explain those later.

Exercise 2

Position your cursor on the line that says diamonds and run the code.

You should see the following scatter plot of the diamonds data appear below the chunk:

Congratulations! You have just run your first lines of R. The code to produce the plot consisted of three lines. You can also run part of a line by highlighting just the code you want to run, as you’ll see in the next exercise.

Exercise 3

  1. Select (highlight) the word diamonds.
  2. Run the code.

This prints the first few lines of the diamonds data:

Example of running highlighted code

Exercise 4: Making new chunks

  1. Find the instructions for Exercise 4 in your workbook.
  2. Create a new chunk below the instructions.
  3. Inside the chunk, write a line of code which adds together the numbers 9, 4, 55 and 2.
  4. Run the the line of code you have written.

The output from the chunk should look like this:

Result from Exercise 4

Loading packages

  • Loading a ‘package’ adds functionality to R
  • Some packages (like tidyverse and pysdata) also include example datasets
  • To load tidyverse write library(tidyverse)
  • Load tidyverse and psydata before each session

The following R code is used in the video:

# load the tidyverse package
# (this also loads the diamonds example dataset, and some others)
library(tidyverse)

By loading ‘packages’, you can add extra functions and datasets to R.

Packages are a powerful feature which allow R to be extended. This means you can run almost any analysis, or make any type of plot.

Packages are loaded using the library() function.

The first function you ran above was library(tidyverse). This loaded additional functions you need to make a scatter plot.

The tidyverse package is so fundamental to this course that library(tidyverse) is likely to be the first line of R code, in the first chunk, in all your RMarkdown files.

It’s a good idea to load packages at the top of your R code files. This makes it easy to see which have been loaded, and avoids loading them twice which is occasionally a problem.

You do need to remember to actually run the lines of code to load libraries though. Beginners often forget to do this — but it’s an easy error to fix.

If you’ve understood what packages are then it should be clear that you can’t use the functions provided by tidyverse until you’ve run the command: library(tidyverse).

For example, if you tried to produce the scatter plot before loading tidyverse you’d see an error like this in the console:

Error in diamonds %>% ggplot(aes(carat, price, colour = clarity)) :
  could not find function "%>%"

This is important because could not find function errors are one of the most common problems that beginners encounter. They normally mean that you have

  1. forgotten to include library(tidyverse) as the first line in your code, or
  2. forgotten to run that line.

Datasets

Datasets are like spreadsheets. They have have:

  • multiple rows, with one row per observation
  • multiple columns; each column has a name.
  • columns also (sometimes) get called variables; this can be confusing

Where are datasets?

  • R has some built-in datasets as learning examples
  • The psydata package includes datasets used in this course
  • Later on, we will import data from files (e.g. actual spreadsheets)

Exploring and checking data

  • View a whole dataset by typing its name and running it in a code chunk
  • glimpse() shows a list of all the columns, plus a few of the datapoints
  • The Environment pane shows a spreadsheet-like view of the data
# always laod the tidyverse first
library(tidyverse)

# the psydata package contains datasets for this course
library(psydata)

# display the `fuel` dataset, by typing the name
# and running this in a code chunk
fuel

# show only the first 6 rows of the `fuel` data
head(fuel)

# shows a list of columns in the `development` dataset
# plus the first few datapoints (as many as will fit)
glimpse(development)

Datasets contain are like spreadsheets: they are organised into columns and rows.

In R, datasets are normally stored in a container called a data.frame. They can also be stored in a tibble (these are basically the same thing).

Columns

Each column in a dataset has a name.

We sometimes call the columns variables, because each column will often relate to a variable in our study.

However, this can be a bit confusing because — in R — variables can actually contain whole datasets. fuel, for example, is the name of a variable which contains an example dataset, provided by the psydata package.

[SHOW LIBRARY(PSYDATA) AND THEN THE FUEL DATASET]

But these words are used flexibly and interchangeably, so we’ll just have to get used to it. It’s normally clear which type of variable we mean from the context.

Rows

Each row in a dataset represents an observation.

In different datasets an observation might correspond to an individual participant, a whole country, or even just a single button press in an experiment.

Packaged datasets

Some datasets are built into R packages as examples for beginners.

For this course, we created a package called psydata which includes the data we need for teaching.

This is installed on the RStudio server. To load it we run:

library(psydata)

We can see from the loading message that one of the datasets is called fuel. This contains data about cars — things like weight, fuel economy, engine size.

Let’s display this data in using a new chunk. If we type the word fuel, select this variable name with our cursor, and ‘execute’ it, we can see the data it contains:

fuel
    mpg cyl engine_size power weight gear automatic
1  21.0   6        2620   110   1188    4      TRUE
2  21.0   6        2620   110   1304    4      TRUE
3  22.8   4        1770    93   1052    4      TRUE
4  21.4   6        4230   110   1458    3     FALSE
5  18.7   8        5900   175   1560    3     FALSE
6  18.1   6        3690   105   1569    3     FALSE
7  14.3   8        5900   245   1619    3     FALSE
8  24.4   4        2400    62   1447    4     FALSE
9  22.8   4        2310    95   1429    4     FALSE
10 19.2   6        2750   123   1560    4     FALSE
11 17.8   6        2750   123   1560    4     FALSE
12 16.4   8        4520   180   1846    3     FALSE
13 17.3   8        4520   180   1692    3     FALSE
14 15.2   8        4520   180   1715    3     FALSE
15 10.4   8        7730   205   2381    3     FALSE
16 10.4   8        7540   215   2460    3     FALSE
17 14.7   8        7210   230   2424    3     FALSE
18 32.4   4        1290    66    998    4      TRUE
19 30.4   4        1240    52    733    4      TRUE
20 33.9   4        1170    65    832    4      TRUE
21 21.5   4        1970    97   1118    3     FALSE
22 15.5   8        5210   150   1597    3     FALSE
23 15.2   8        4980   150   1558    3     FALSE
24 13.3   8        5740   245   1742    3     FALSE
25 19.2   8        6550   175   1744    3     FALSE
26 27.3   4        1290    66    878    4      TRUE
27 26.0   4        1970    91    971    5      TRUE
28 30.4   4        1560   113    686    5      TRUE
29 15.8   8        5750   264   1438    5      TRUE
30 19.7   6        2380   175   1256    5      TRUE
31 15.0   8        4930   335   1619    5      TRUE
32 21.4   4        1980   109   1261    4      TRUE

By default this shows the first ten rows and columns of the data. You can see other rows using the Next, Previous and number buttons below the data.

If your browser window is very narrow you may need to view some of the columns by using the arrow next to the final, right-hand column.

You can get information about the columns in all these example datasets by typing: help(name_of_the_dataset_you_want_to_know_about). For example:

help(fuel)
No documentation for 'fuel' in specified packages and libraries:
you could try '??fuel'

Exploring and checking data

There are two ways we recommend to inspect and check data you are using.

  1. Typing the name of the dataset, and running that as code
  2. The glimpse() function, which shows a list of all the columns and some of the data
  3. The head() function, which shows the first 10 rows

To use glimpse:

glimpse(fuel)
Rows: 32
Columns: 7
$ mpg         <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14…
$ cyl         <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4
$ engine_size <dbl> 2620, 2620, 1770, 4230, 5900, 3690, 5900, 2400, 2310, 2750, 2750, 4520, 4520, 4520, 7730, 7540, 72…
$ power       <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, …
$ weight      <dbl> 1188, 1304, 1052, 1458, 1560, 1569, 1619, 1447, 1429, 1560, 1560, 1846, 1692, 1715, 2381, 2460, 24…
$ gear        <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4
$ automatic   <lgl> TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

This shows a list of all the columns in the dataset, the type of data stored in each column, and as many observations (datapoonts) as will fit on a single line.

glimpse is a really useful view to check which columns are available in a dataset before using them.

Using head:

head(fuel)
   mpg cyl engine_size power weight gear automatic
1 21.0   6        2620   110   1188    4      TRUE
2 21.0   6        2620   110   1304    4      TRUE
3 22.8   4        1770    93   1052    4      TRUE
4 21.4   6        4230   110   1458    3     FALSE
5 18.7   8        5900   175   1560    3     FALSE
6 18.1   6        3690   105   1569    3     FALSE

This prints the first 10 rows of all the columns. head() is really useful for checking the actual datapoints are as you expect before using them.

Why are we talking about cars not psychology?

In this course we mostly use very simple datasets, and some of them aren’t even about psychology.

Some students ask why we don’t always use psychological examples. If this hasn’t troubled you then you could skip to the next section, but we thought we should explain:

We think the fuel dataset (and others, like iris, and development) have a number of benefits.

First, they are either built into R, loaded in common packages, or available in the psydata package. This makes them easily available for everyone.

Second, these data relate to concrete, easy to understand phenemena (e.g. weight, length, number of gears). This means you don’t have to hold in mind any complex psychological/theoretical ideas for the examples to make sense.

Third, the relationships in these datasets are clear, and there aren’t too many data points. Real data are often more messy because many psychological constructs are hard to measure.

Our experience is that, when learning R, it pays to keep everything as simple as it possibly can be. The skills and concepts involved in analysing these data are the same though.

R — and the techniques and statistics we teach — are used right across the natural sciences

[ TODO show examples of plots and analyses here]

If you’re still not convinced — don’t worry … we do include some clinical examples, and we will be collecting our own psychological data soon enough and analysing that.

Exercise 5

  1. Create a new chunk below the Exercise 5 heading in your workbook (session-1.rmd).
  2. Load the psydata package
  3. Display the fuel dataset and try out the navigation buttons.
  4. Make a list of columns in the development datset

The output should look like this:

The fuel dataset

Columns in the development dataset

Exercise 6

  1. Create a new chunk below the Exercise 6 heading in your workbook (session-1.rmd).
  2. Load the psydata package if you haven’t already done that in this work session
  3. Show the first 10 rows of the development data

Use the output to answer the following question. After entering your answer, click outside the box. The border will turn turn blue when the answer is correct.

The population of Afghanistan in 1967 was: .

Making a scatterplot with ggplot()

  • A scatterplot shows the relationship between two continuous variables (columns)
  • Each observation (row) must have at least two values (columns)
  • These define the position of a point on the x and y axes of the plot
  • Use ggplot()
  • aes(x = ..., y = ...) chooses the x and y data columns and creates the axes
  • geom_point() adds the points
# if you have not already, load these packages
library(tidyverse)
library(psydata)

# make a scatterplot from the fuel dataset
fuel %>%
  ggplot(aes(x=weight, y=mpg)) + # selects the columns to use
    geom_point()                 # adds the points to the plot



# the same plot
# this time we left out x= and y= in the aes code
# these are implicit in the order of weight and mpg
# the x axis comes first
fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()         

A scatterplot shows the relationship between two variables by plotting their values as points on an x axis (the left-right position) and y-axis (up down).

This code chunk creates a scatterplot. We start with the fuel data.

The %>% symbol is special, it’s called a ‘pipe’. We’ll explain more later, but for now just know that it sends the fuel data on to the next line of code — like it’s passing it down a pipe.

fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()

[SHOW THE CODE, SELECT THE PIPE WITH CURSOR]

The second line recieves the data. The ggplot() function tells us we are going to be making a plot.

The plot itself is built in two steps. The first step, ggplot(aes(weight, mpg)) selects columns in our dataset to use for the x and y axes. In this case, the x axis is weight, which is the weight of the cars in kg.

[SELECT WEIGHT IN CODE]

And mpg is miles per gallon, or fuel efficiency. This will be the y-axis.

We can see the plot if we put our cursor on the first line and press the shortcut — Ctrl or Cmd + Enter

[RUN THE PLOT AND SHOW]

As you’ve seen before, if we run code in an Rmarkdown document then the result is shown underneath the chunk.

Building plots in layers

A useful thing to know is that ggplot works by building up plots in multiple layers.

If we run just this part of the code, we can see the plot with just the axes, and no data shown.

[RUN JUST THE FIRST TWO LINES OF CODE BY SELECTING AND PRESSING CTRL+ENTER]

So, we make plots by:

  1. selecting data
  2. defining the axes, and then
  3. adding the data points

Each part of the plot is separated by a + symbol and goes on a new line.

RStudio is smart and knows all this is part of the sample plot, so automatically indents the code.

Cutting corners

There’s just one final thing to explain: In the previous code we wrote x = weight and y = mpg.

This makes things explicit, which is nice, but takes longer to type. You can also write the plot this way:

fuel %>%
  ggplot(aes(weight, mpg)) +
    geom_point()         

R assumes that the first variable is the x axis and the second is the y axis.

[SELECT X AXIS AND Y AXIS IN TURN WHEN DESCRIBING]

We use this style in these guides, and you should too.

Exercise 7

  1. Create a new chunk below the Exercise 7 heading in your workbook.
  2. Using the fuel dataset, create a scatterplot with engine_size on the x-axis and mpg (miles per gallon, or fuel economy) on the y-axis.
  3. Run the chunk.

The scatterplot should look like this:

Check your knowledge

Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 2.

  1. How do you run part of a line of R code using the keyboard short cut?
  2. Which library will you always need to load in your first R Markdown chunk?
  3. What is psydata?
  4. How would you look at/inspect a whole dataset?
  5. What does glimpse() do and when is it useful?
  6. What is the 5th column in the development dataset?
  7. Which function makes a plot? (there are many, but we mean the one shown above)
  8. Which function chooses the columns of data used in the plot?

Extension exercises

Please remember that these extension exercises are not required to pass the course. We include them because find that some students work through these materials much more quickly than others — perhaps because they have more previous experience with programming — and we aim to give all students the opportunity to stretch their skills.

If you do find you have extra time, however, these exercises are intended to provide additional practice in the technqiues taught here, and to be useful preparation for using R independently in a stage 4 or MSc research project.

Extension exercise 1

This scatterplot uses the fuel dataset to show a vehicle’s power on the x-axis against mpg on the y-axis.

In a new chunk, write the R code to produce this plot.

Extension exercise 2

There is another built-in dataset called iris which includes data about different flower species.

Use glimpse() to get a list of the column names.

Make a scatterplot which shows the relationships between petal widths and lengths.

Further reading

Scatterplots and visualisation: Fundamentals of Data Visualization is an excellent resource for data visualisation in R. This chapter: https://clauswilke.com/dataviz/visualizing-associations.html shows many examples of plots which display relationships between variables (including scatter plots) which would extend the material here.